Functions and Character Manipulation

1  Functions

When you create a function, it defines a separate environment and the variables you create inside your function only exist in that function environment; when you return to where you called the function from, those variables no longer exist. You can refer to other objects that are in the calling environment, but if you make any changes to them, the changes will only take place in the function environment. To get information back to the calling environment, you must pass a return value, which will be available through the functions name. R will automatically return the last unassigned value it encounters in your function, or you can place the object you want to return in a call to the return function. You can only return a single object from a function in R; if you need to return multiple objects, you need to return a list containing those objects, and extract them from the list when you return to the calling environment.
As a simple example of a function that returns a value, suppose we want to calculate the ratio of the maximum value of a vector to the minimum value of the vector. Here's a function definition that will do the job:
maxminratio = function(x)max(x)/min(x)

Notice for a single line function you don't need to use brackets ({}) around the function body, but you are free to do so if you like. Since the final statement wasn't assigned to a variable, it will be used as a return value when the function is called. Alternatively, the value could be placed in a call to the return function. If we wanted to find the max to min ratio for all the columns of the matrix, we could use our function with the apply function:
apply(mymat,2,maxminratio)

The 2 in the call to apply tells it to operate on the columns of the matrix; a 1 would be used to work on the rows.
Before we leave this example, it should be pointed out that this function has a weakness - what if we pass it a vector that has missing values? Since we're calling min and max without the na.rm=TRUE argument, we'll always get a missing value if our input data has any missing values. One way to solve the problem is to just put the na.rm=TRUE argument into the calls to min and max. A better way would be to create a new argument with a default value. That way, we still only have to pass one argument to our function, but we can modify the na.rm= argument if we need to.
maxminratio = function(x,na.rm=TRUE)max(x,na.rm=na.rm)/min(x,na.rm=na.rm)

If you look at the function definitions for functions in R, you'll see that many of them use this method of setting defaults in the argument list.
As another example of a function, recall the graph of income versus literacy with different colored points for the different continents. If we were working with the datasets like the world1 dataset and wanted to create a variety of plots, we could write a function like this:
worldplotter = function(data,xvar,yvar,cvar,colors,ltitle=cvar,legendloc='topleft'){
   colorvar = factor(data[,cvar])
   with(data,plot(data[,xvar],data[,yvar],col=colors[colorvar],xlab=xvar,ylab=yvar))
   with(data,legend(legendloc,legend=levels(colorvar),col=colors,pch=1,title=ltitle))
}


Now we could produce the income versus literacy graph by calling:
worldplotter(world1,'literacy','income','cont',c('red','blue','green','orange','yellow','violet'))

By changing the arguments, a variety of plots can be produced.
As your functions get longer and more complex, it becomes more difficult to simply type them into an interactive R session. To make it easy to edit functions, R provides the edit command, which will open an editor appropriate to your operating system. When you close the editor, the edit function will return the edited copy of your function, so it's important to remember to assign the return value from edit to the function's name. If you've already defined a function, you can edit it by simply passing it to edit, as in
minmaxratio = edit(minmaxratio)

You may also want to consider the fix function, which automates the process slightly.
To start from scratch, you can use a call to edit like this:
newfunction = edit(function(){})

2  Sizes of Objects

Before we start looking at character manipulation, this is a good time to review the different functions that give us the size of an object.
  1. length - returns the length of a vector, or the total number of elements in a matrix (number of rows times number of columns). For a data frame, returns the number of columns.
  2. dim - for matrices and data frames, returns a vector of length 2 containing the number of rows and the number of columns. For a vector, returns NULL. The convenience functions nrow and ncol return the individual values that would be returned by dim.
  3. nchar - for a character string, returns the number of characters in the string. Returns a vector of values when applied to a vector of character strings. For a numeric value, nchar returns the number of characters in the printed representation of the number.

3  Character Manipulation

While it's quite natural to think of data as being numbers, manipulating character strings is also an important skill when working with data. We've already seen a few simple examples, such as choosing the right format for a character variable that represents a date, or using table to tabulate the occurences of different character values for a variable. Now we're going to look at some functions in R that let us break apart, rearrange and put together character data.
One of the most important uses of character manipulation is "massaging" data into shape. Many times the data that is available to us, for example on a web page or as output from another program, isn't in a form that a program like R can easily interpret. In cases like that, we'll need to remove the parts that R can't understand, and organize the remaining parts so that R can read them efficiently.
Let's take a look at some of the functions that R offers for working with character variables:

4  Working with Characters

As you probably noticed when looking at the above functions, they are very simple, and, quite frankly, it's hard to see how they could really do anything complex on their own. In fact, that's just the point of these functions - they can be combined together to do just about anything you would want to do. As an example, consider the task of capitalizing the first character of each word in a string. The toupper function can change the case of all the characters in a string, but we'll need to do something to separate out the characters so we can get the first one. If we call strsplit with an empty string for the splitting character, we'll get back a vector of the individual characters:
> str = 'sherlock holmes'
> letters = strsplit(str,'')
> letters
[[1]]
 [1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s"
> theletters = letters[[1]]

Notice that strsplit always returns a list. This will be very useful later, but for now we'll extract the first element before we try to work with its output.
The places that we'll need to capitalize things are the first position in the vector or letters, and any letter that comes after a blank. We can find those positions very easily:
> wh = c(1,which(theletters == ' ') + 1)
> wh
[1]  1 10

We can change the case of the letters whose indexes are in wh, then use paste to put the string back together.
> theletters[wh] = toupper(theletters[wh])
> paste(theletters,collapse='')
[1] "Sherlock Holmes"

Things have gotten complicated enough that we could probably stand to write a function:
maketitle = function(txt){
  theletters = strsplit(txt,'')[[1]]
  wh = c(1,which(theletters  == ' ') + 1)
  theletters[wh] = toupper(theletters[wh])
  paste(theletters,collapse='')
}

Of course, we should always test our functions:
> maketitle('some crazy title')
[1] "Some Crazy Title"

Now suppose we have a vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')

We can always hope that we'll get the right answer if we just use our function:
> maketitle(titls)
[1] "Sherlock Holmes"

Unfortunately, it didn't work in this case. Whenever that happens, sapply will operate on all the elements in the vector:
> sapply(titls,maketitle)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air" 

Of course, this isn't the only way to solve the problem. Rather than break up the string into individual letters, we can break it up into words, and capitalize the first letter of each, then combine them back together. Let's explore that approach:
> str = 'sherlock holmes'
> words = strsplit(str,' ')
> words
[[1]]
[1] "sherlock" "holmes"  

Now we can use the assignment form of the substring function to change the first letter of each word to a capital. Note that we have to make sure to actually return the modified string from our call to sapply, so we insure that the last statement in our function returns the string:
> sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
  sherlock     holmes 
"Sherlock"   "Holmes" 

Now we can paste the pieces back together to get our answer:
> res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
> paste(res,collapse=' ')
[1] "Sherlock Holmes"

To operate on a vector of strings, we'll need to incorporate these steps into a function, and then call sapply:
mktitl = function(str){
   words = strsplit(str,' ')
   res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
   paste(res,collapse=' ')
}

We can test the function, making sure to use a string different than the one we used in our initial test:
> mktitl('some silly string')
[1] "Some Silly String"

And now we can test it on the vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')
> sapply(titls,mktitl)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air" 

How can we compare the two methods? The R function system.time will report the amount of time any operation in R uses. One important caveat - if you wish to assign an expression to a value in the system.time call, you must use the "<-" assignment operator, because the equal sign will confuse the function into thinking you're specifying a named parameter in the function call. Let's try system.time on our two functions:
> system.time(one <- maketitle(titls))
   user  system elapsed 
      0       0       0 
> system.time(two <- mktitl(titls))
   user  system elapsed 
  0.000   0.000   0.001 

For such a tiny example, we can't really trust that the difference we see is real. Let's use the movie names from a previous example:
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> nms = tolower(movies$name)
> system.time(one <- maketitle(nms))
   user  system elapsed 
  0.000   0.000   0.001 
> system.time(two <- mktitl(nms))
   user  system elapsed 
  0.008   0.000   0.007 

It looks like the first method is better than the second. Of course, if they don't get the same answer, it doesn't really matter how fast they are. In R, the all.equal function can be used to see if things are the same:
> all.equal(one,two)
[1] TRUE




File translated from TEX by TTH, version 3.67.
On 8 Feb 2010, 13:59.